xgboost model
Machine learning in an expectation-maximisation framework for nowcasting
Wilsens, Paul, Antonio, Katrien, Claeskens, Gerda
Decision making often occurs in the presence of incomplete information, leading to the under- or overestimation of risk. Leveraging the observable information to learn the complete information is called nowcasting. In practice, incomplete information is often a consequence of reporting or observation delays. In this paper, we propose an expectation-maximisation (EM) framework for nowcasting that uses machine learning techniques to model both the occurrence as well as the reporting process of events. We allow for the inclusion of covariate information specific to the occurrence and reporting periods as well as characteristics related to the entity for which events occurred. We demonstrate how the maximisation step and the information flow between EM iterations can be tailored to leverage the predictive power of neural networks and (extreme) gradient boosting machines (XGBoost). With simulation experiments, we show that we can effectively model both the occurrence and reporting of events when dealing with high-dimensional covariate information. In the presence of non-linear effects, we show that our methodology outperforms existing EM-based nowcasting frameworks that use generalised linear models in the maximisation step. Finally, we apply the framework to the reporting of Argentinian Covid-19 cases, where the XGBoost-based approach again is most performant.
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- Oceania > New Zealand (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Epidemiology (1.00)
- Banking & Finance > Insurance (1.00)
- Health & Medicine > Therapeutic Area > Immunology (0.88)
Lightweight ML-Based Air Quality Prediction for IoT and Embedded Applications
Sami, Md. Sad Abdullah, Abid, Mushfiquzzaman
This study investigates the effectiveness and efficiency of two variants of the XGBoost regression model, the full-capacity and lightweight (tiny) versions, for predicting the concentrations of carbon monoxide (CO) and nitrogen dioxide (NO2). Using the AirQualityUCI dataset collected over one year in an urban environment, we conducted a comprehensive evaluation based on widely accepted metrics, including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Bias Error (MBE), and the coefficient of determination (R2). In addition, we assessed resource-oriented metrics such as inference time, model size, and peak RAM usage. The full XGBoost model achieved superior predictive accuracy for both pollutants, while the tiny model, though slightly less precise, offered substantial computational benefits with significantly reduced inference time and model storage requirements. These results demonstrate the feasibility of deploying simplified models in resource-constrained environments without compromising predictive quality. This makes the tiny XGBoost model suitable for real-time air-quality monitoring in IoT and embedded applications.
- Europe > Italy (0.05)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
Towards Carbon-Aware Container Orchestration: Predicting Workload Energy Consumption with Federated Learning
Saad, Zainab, Yang, Jialin, Leung, Henry, Drew, Steve
The growing reliance on large-scale data centers to run resource-intensive workloads has significantly increased the global carbon footprint, underscoring the need for sustainable computing solutions. While container orchestration platforms like Kubernetes help optimize workload scheduling to reduce carbon emissions, existing methods often depend on centralized machine learning models that raise privacy concerns and struggle to generalize across diverse environments. In this paper, we propose a federated learning approach for energy consumption prediction that preserves data privacy by keeping sensitive operational data within individual enterprises. By extending the Kubernetes Efficient Power Level Exporter (Kepler), our framework trains XGBoost models collaboratively across distributed clients using Flower's FedXgbBagging aggregation using a bagging strategy, eliminating the need for centralized data sharing. Experimental results on the SPECPower benchmark dataset show that our FL-based approach achieves 11.7 percent lower Mean Absolute Error compared to a centralized baseline. This work addresses the unresolved trade-off between data privacy and energy prediction efficiency in prior systems such as Kepler and CASPER and offers enterprises a viable pathway toward sustainable cloud computing without compromising operational privacy.
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- North America > United States (0.04)
- Information Technology > Security & Privacy (1.00)
- Energy (1.00)
Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN-XGBoost Ensemble
Siddiqui, Md. Saiful Bari, Tarannum, Nowshin
Antimicrobial Resistance (AMR) is a rapidly escalating global health crisis. While genomic sequencing enables rapid prediction of resistance phenotypes, current computational methods have limitations. Standard machine learning models treat the genome as an unordered collection of features, ignoring the sequential context of Single Nucleotide Polymorphisms (SNPs). State-of-the-art sequence models like Transformers are often too data-hungry and computationally expensive for the moderately-sized datasets that are typical in this domain. To address these challenges, we propose AMR-EnsembleNet, an ensemble framework that synergistically combines sequence-based and feature-based learning. We developed a lightweight, custom 1D Convolutional Neural Network (CNN) to efficiently learn predictive sequence motifs from high-dimensional SNP data. This sequence-aware model was ensembled with an XGBoost model, a powerful gradient boosting system adept at capturing complex, non-local feature interactions. We trained and evaluated our framework on a benchmark dataset of 809 E. coli strains, predicting resistance across four antibiotics with varying class imbalance. Our 1D CNN-XGBoost ensemble consistently achieved top-tier performance across all the antibiotics, reaching a Matthews Correlation Coefficient (MCC) of 0.926 for Ciprofloxacin (CIP) and the highest Macro F1-score of 0.691 for the challenging Gentamicin (GEN) AMR prediction. We also show that our model consistently focuses on SNPs within well-known AMR genes like fusA and parC, confirming it learns the correct genetic signals for resistance. Our work demonstrates that fusing a sequence-aware 1D CNN with a feature-based XGBoost model creates a powerful ensemble, overcoming the limitations of using either an order-agnostic or a standalone sequence model.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Arizona (0.04)
- Asia > Singapore (0.04)
- Research Report (0.64)
- Overview (0.46)
Graph Transformer-Based Flood Susceptibility Mapping: Application to the French Riviera and Railway Infrastructure Under Climate Change
Vemula, Sreenath, Gatti, Filippo, Jehel, Pierre
Increasing flood frequency and severity due to climate change threatens infrastructure and demands improved susceptibility mapping techniques. While traditional machine learning (ML) approaches are widely used, they struggle to capture spatial dependencies and poor boundary delineation between susceptibility classes. This study introduces the first application of a graph transformer (GT) architecture for flood susceptibility mapping to the flood-prone French Riviera (e.g., 2020 Storm Alex) using topography, hydrology, geography, and environmental data. GT incorporates watershed topology using Laplacian positional encoders (PEs) and attention mechanisms. The developed GT model has an AUC-ROC (0.9739), slightly lower than XGBoost (0.9853). However, the GT model demonstrated better clustering and delineation with a higher Moran's I value (0.6119) compared to the random forest (0.5775) and XGBoost (0.5311) with p-value lower than 0.0001. Feature importance revealed a striking consistency across models, with elevation, slope, distance to channel, and convergence index being the critical factors. Dimensionality reduction on Laplacian PEs revealed partial clusters, indicating they could capture spatial information; however, their importance was lower than flood factors. Since climate and land use changes aggravate flood risk, susceptibility maps are developed for the 2050 year under different Representative Concentration Pathways (RCPs) and railway track vulnerability is assessed. All RCP scenarios revealed increased area across susceptibility classes, except for the very low category. RCP 8.5 projections indicate that 17.46% of the watershed area and 54% of railway length fall within very-high susceptible zones, compared to 6.19% and 35.61%, respectively, under current conditions. The developed maps can be integrated into a multi-hazard framework.
- North America > Canada (0.04)
- Europe > Italy (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Alpes-de-Haute-Provence (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Centralized vs. Federated Learning for Educational Data Mining: A Comparative Study on Student Performance Prediction with SAEB Microdata
The application of data mining and artificial intelligence in education offers unprecedented potential for personalizing learning and early identification of at-risk students. However, the practical use of these techniques faces a significant barrier in privacy legislation, such as Brazil's General Data Protection Law (LGPD), which restricts the centralization of sensitive student data. To resolve this challenge, privacy-preserving computational approaches are required. The present study evaluates the feasibility and effectiveness of Federated Learning, specifically the FedProx algorithm, to predict student performance using microdata from the Brazilian Basic Education Assessment System (SAEB). A Deep Neural Network (DNN) model was trained in a federated manner, simulating a scenario with 50 schools, and its performance was rigorously benchmarked against a centralized eXtreme Gradient Boosting (XGBoost) model. The analysis, conducted on a universe of over two million student records, revealed that the centralized model achieved an accuracy of 63.96%. Remarkably, the federated model reached a peak accuracy of 61.23%, demonstrating a marginal performance loss in exchange for a robust privacy guarantee. The results indicate that Federated Learning is a viable and effective solution for building collaborative predictive models in the Brazilian educational context, in alignment with the requirements of the LGPD.
- South America > Brazil > Rio Grande do Norte (0.04)
- Asia > Middle East > Saudi Arabia (0.04)
- Research Report > New Finding (1.00)
- Workflow (0.93)
- Information Technology > Security & Privacy (1.00)
- Education > Assessment & Standards > Student Performance (1.00)
- Education > Educational Setting > Higher Education (0.93)
Machine Learning-Based Manufacturing Cost Prediction from 2D Engineering Drawings via Geometric Features
Arıkan, Ahmet Bilal, Özönder, Şener, Koçyiğit, Mustafa Taha, Altun, Hüseyin Oktay, Küçükkartal, H. Kübra, Arslanoğlu, Murat, Çağırankaya, Fatih, Ayvaz, Berk
We present an integrated machine learning framework that transforms how manufacturing cost is estimated from 2D engineering drawings. Unlike traditional quotation workflows that require labor-intensive process planning, our approach about 200 geometric and statistical descriptors directly from 13,684 DWG drawings of automotive suspension and steering parts spanning 24 product groups. Gradient-boosted decision tree models (XGBoost, CatBoost, LightGBM) trained on these features achieve nearly 10% mean absolute percentage error across groups, demonstrating robust scalability beyond part-specific heuristics. By coupling cost prediction with explainability tools such as SHAP, the framework identifies geometric design drivers including rotated dimension maxima, arc statistics and divergence metrics, offering actionable insights for cost-aware design. This end-to-end CAD-to-cost pipeline shortens quotation lead times, ensures consistent and transparent cost assessments across part families and provides a deployable pathway toward real-time, ERP-integrated decision support in Industry 4.0 manufacturing environments.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Cross-Domain Web Information Extraction at Pinterest
Farag, Michael, Halina, Patrick, Zaytsev, Andrey, Munagala, Alekhya, Ahmed, Imtihan, Wang, Junhao
The internet offers a massive repository of unstructured information, but it's a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce websites is essential to enhance user experiences and improve content distribution. In this paper, we present Pinterest's system for attribute extraction, which achieves remarkable accuracy and scalability at a manageable cost. Our approach leverages a novel webpage representation that combines structural, visual, and text modalities into a compact form, optimizing it for small model learning. This representation captures each visible HTML node with its text, style and layout information. We show how this allows simple models such as eXtreme Gradient Boosting (XGBoost) to extract attributes more accurately than much more complex Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT). Our results demonstrate a system that is highly scalable, processing over 1,000 URLs per second, while being 1000 times more cost-effective than the cheapest GPT alternatives.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Ontario > Toronto (0.06)
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Exploring molecular assembly as a biosignature using mass spectrometry and machine learning
Rutter, Lindsay A., Sharma, Abhishek, Seet, Ian, Alobo, David Obeh, Goto, An, Cronin, Leroy
Molecular assembly offers a promising path to detect life beyond Earth, while minimizing assumptions based on terrestrial life. As mass spectrometers will be central to upcoming Solar System missions, predicting molecular assembly from their data without needing to elucidate unknown structures will be essential for unbiased life detection. An ideal agnostic biosignature must be interpretable and experimentally measurable. Here, we show that molecular assembly, a recently developed approach to measure objects that have been produced by evolution, satisfies both criteria. First, it is interpretable for life detection, as it reflects the assembly of molecules with their bonds as building blocks, in contrast to approaches that discount construction history. Second, it can be determined without structural elucidation, as it can be physically measured by mass spectrometry, a property that distinguishes it from other approaches that use structure-based information measures for molecular complexity. Whilst molecular assembly is directly measurable using mass spectrometry data, there are limits imposed by mission constraints. To address this, we developed a machine learning model that predicts molecular assembly with high accuracy, reducing error by three-fold compared to baseline models. Simulated data shows that even small instrumental inconsistencies can double model error, emphasizing the need for standardization. These results suggest that standardized mass spectrometry databases could enable accurate molecular assembly prediction, without structural elucidation, providing a proof-of-concept for future astrobiology missions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Energy (0.95)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Interpretable Low-Dimensional Modeling of Spatiotemporal Agent States for Decision Making in Football Tactics
Ide, Kenjiro, Someya, Taiga, Kawaguchi, Kohei, Fujii, Keisuke
Understanding football tactics is crucial for managers and analysts. Previous research has proposed models based on spatial and kinematic equations, but these are computationally expensive. Also, Reinforcement learning approaches use player positions and velocities but lack interpretability and require large datasets. Rule-based models align with expert knowledge but have not fully considered all players' states. This study explores whether low-dimensional, rule-based models using spatiotemporal data can effectively capture football tactics. Our approach defines interpretable state variables for both the ball-holder and potential pass receivers, based on criteria that explore options like passing. Through discussions with a manager, we identified key variables representing the game state. We then used StatsBomb event data and SkillCorner tracking data from the 2023$/$24 LaLiga season to train an XGBoost model to predict pass success. The analysis revealed that the distance between the player and the ball, as well as the player's space score, were key factors in determining successful passes. Our interpretable low-dimensional modeling facilitates tactical analysis through the use of intuitive variables and provides practical value as a tool to support decision-making in football.